String Processing in R

Author

Martin Schweinberger

Introduction

This tutorial introduces string processing in R — the art of manipulating, searching, extracting, and transforming character data. String processing is a foundational skill for linguistic research: nearly every corpus study, text-mining project, or annotation pipeline begins with reading raw text and ends with cleaned, structured character data ready for analysis.

The tutorial is aimed at beginners and intermediate R users. It covers a progression from basic string operations in base R and the stringr package, through regular expressions, through text-cleaning pipelines, to tokenisation with quanteda. Each section introduces functions with linguistic examples and includes worked exercises.

Prerequisite Tutorials

Before working through this tutorial, you should be familiar with:

If you are new to R, work through Getting Started with R first.

Learning Objectives

By the end of this tutorial you will be able to:

  1. Apply core base R string functions (nchar, paste, substr, gsub, grep, tolower, toupper)
  2. Use the full suite of stringr functions for detecting, extracting, replacing, splitting, padding, and combining strings
  3. Use str_glue() and str_glue_data() for string interpolation in reports and data pipelines
  4. Work with factors as strings using forcats — relabel, reorder, collapse, and filter factor levels
  5. Format strings for table output using padding, truncation, and number formatting
  6. Handle Unicode, encoding issues, and non-ASCII characters (IPA, non-Latin scripts)
  7. Write regular expressions including character classes, quantifiers, anchors, alternation, named capture groups, and lookahead/lookbehind
  8. Build reproducible text-cleaning pipelines combining multiple string operations
  9. Tokenise text using quanteda and understand the difference between word, sentence, and character tokenisation
Citation

Schweinberger, Martin. 2026. String Processing in R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/string/string.html (Version 2026.02.24).


Interactive Notebook

An interactive, notebook-based version of this tutorial is available via the Binder link below. It allows you to upload your own texts, apply cleaning operations, and download the results without installing anything locally.

Click here to open the interactive string-processing notebook.


Setup

Installing Packages

Code
# Run once — comment out after installation
install.packages("tidyverse")   # stringr, dplyr, tidyr, purrr, ggplot2, forcats
install.packages("here")        # reproducible file paths
install.packages("flextable")   # formatted tables
install.packages("quanteda")    # tokenisation and corpus tools
install.packages("tm")          # text-mining utilities (stopwords, stemming)
install.packages("checkdown")   # interactive quiz questions
install.packages("remotes")
remotes::install_github("rlesur/klippy")

Loading Packages

Code
library(tidyverse)   # loads stringr, dplyr, purrr, ggplot2, forcats
library(here)
library(flextable)
library(quanteda)
library(tm)
library(checkdown)
klippy::klippy()

Loading Example Texts

Throughout this tutorial we work with four example texts loaded from the LADAL data repository.

Code
# Text 1: paragraph about grammar (single string)
exampletext <- base::readRDS("tutorials/string/data/tx1.rda", "rb")

# Text 2: same paragraph split into sentences (character vector)
splitexampletext <- base::readRDS("tutorials/string/data/tx2.rda", "rb")

# Text 3: paragraph about Ferdinand de Saussure (single string)
additionaltext <- base::readRDS("tutorials/string/data/tx3.rda", "rb")

# Text 4: three short sentences (character vector)
sentences <- base::readRDS("tutorials/string/data/tx4.rda", "rb")

# Inspect
cat("exampletext (first 120 chars):\n", substr(exampletext, 1, 120), "\n\n")
exampletext (first 120 chars):
 Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply t 
Code
cat("splitexampletext:\n"); print(splitexampletext); cat("\n")
splitexampletext:
[1] "Grammar is a system of rules which governs the production and use of utterances in a given language."                                                                                                                                                                                                   
[2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
[3] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."                                                                                                                                                                         
Code
cat("sentences:\n"); print(sentences)
sentences:
[1] "This is a first sentence."     "This is a second sentence."   
[3] "And this is a third sentence."
Character Vectors in R

A character vector is R’s basic data structure for text. Each element is a separate string — exampletext is length 1 (one long string), while splitexampletext is length n (one element per sentence). Most stringr functions are vectorised: they accept vectors of any length and return a result of the same length, making it easy to process many strings at once.


Base R String Functions

Section Overview

What you will learn: The most important string functions available in base R — no packages required. These underpin everything else and appear throughout code you will encounter in the wild.

Case Conversion

Code
tolower(exampletext) |> substr(1, 80)
[1] "grammar is a system of rules which governs the production and use of utterances "
Code
toupper(exampletext) |> substr(1, 80)
[1] "GRAMMAR IS A SYSTEM OF RULES WHICH GOVERNS THE PRODUCTION AND USE OF UTTERANCES "

String Length

Code
# Number of characters per element
nchar(splitexampletext)
[1] 100 295 126
Code
# NA-safe version
nchar(c("hello", NA, "world"), keepNA = TRUE)
[1]  5 NA  5

Substrings

Code
# Extract characters 1–60
substr(exampletext, 1, 60)
[1] "Grammar is a system of rules which governs the production an"
Code
# Replacement: overwrite a substring in-place
tmp <- exampletext
substr(tmp, 1, 7) <- "[REDACTED]"  # pads/truncates to match width
substr(tmp, 1, 25)
[1] "[REDACT is a system of ru"

Combining Strings

Code
paste("Participant", 1:4, sep = "_")       # with separator
[1] "Participant_1" "Participant_2" "Participant_3" "Participant_4"
Code
paste0("Item", LETTERS[1:4])               # no separator
[1] "ItemA" "ItemB" "ItemC" "ItemD"
Code
paste(sentences, collapse = " | ")         # collapse vector to one string
[1] "This is a first sentence. | This is a second sentence. | And this is a third sentence."

Pattern Matching and Replacement

Code
# grep: indices of matching elements
grep("grammar", splitexampletext)
[1] 3
Code
# grepl: logical vector
grepl("grammar", splitexampletext)
[1] FALSE FALSE  TRUE
Code
# sub: replace FIRST match per string
sub("grammar", "GRAMMAR", exampletext) |> substr(1, 80)
[1] "Grammar is a system of rules which governs the production and use of utterances "
Code
# gsub: replace ALL matches per string
gsub("\\band\\b", "&", exampletext) |> substr(1, 80)
[1] "Grammar is a system of rules which governs the production & use of utterances in"
Code
# ignore.case
grep("grammar", splitexampletext, ignore.case = TRUE)
[1] 1 3
gsub() vs. str_replace_all()

Both replace all occurrences of a pattern. The key practical difference is argument order: gsub(pattern, replacement, string) puts the string last (inconvenient for pipes), while str_replace_all(string, pattern, replacement) puts the string first (pipe-friendly). For new code, prefer stringr. For reading legacy code, recognise gsub.

Splitting Strings

Code
# strsplit returns a LIST — one element per input string
words_list <- strsplit(exampletext, "\\s+")
head(words_list[[1]], 10)
 [1] "Grammar"    "is"         "a"          "system"     "of"        
 [6] "rules"      "which"      "governs"    "the"        "production"
Code
# Flatten to a plain vector
words_vec <- strsplit(exampletext, "\\s+")[[1]]
length(words_vec)
[1] 81

You have a character vector texts with 50 sentences. You want the indices of sentences that contain the word “the” (case-insensitive). Which call is correct?

  1. grep("the", texts, ignore.case = TRUE) — returns matching indices
  2. gsub("the", "", texts) — removes “the” from each sentence
  3. grepl("the", texts, ignore.case = TRUE) — returns a logical vector, not indices
  4. sub("the", "THE", texts) — replaces the first match only
Answer

a) grep("the", texts, ignore.case = TRUE)

grep() returns the positions (indices) of matching elements. grepl() (option c) is also useful but returns TRUE/FALSE — use it when filtering with texts[grepl(...)]. Options b and d perform replacements.


Core stringr Functions

Section Overview

What you will learn: The complete set of stringr functions for detecting, extracting, replacing, splitting, padding, ordering, and combining strings — all following the consistent str_verb(string, pattern) convention that makes them ideal for pipelines.

Detecting Patterns

Code
str_detect(splitexampletext, "grammar")           # logical vector
[1] FALSE FALSE  TRUE
Code
str_starts(splitexampletext, "[A-Z]")             # starts with capital
[1] TRUE TRUE TRUE
Code
str_ends(splitexampletext,   "\\.")               # ends with full stop
[1] TRUE TRUE TRUE
Code
str_which(splitexampletext,  "grammar")           # indices of matches
[1] 3
Code
str_count(exampletext, "\\band\\b")               # count occurrences
[1] 6

Extracting Patterns

Code
# First match per element
str_extract(splitexampletext, "\\b[A-Z][a-z]+\\b")
[1] "Grammar" "These"   "Many"   
Code
# All matches per element (returns a list)
str_extract_all(exampletext, "\\b[A-Z][a-z]+\\b")[[1]]
[1] "Grammar" "These"   "Many"    "Noam"    "Chomsky"
Code
# First match plus capture groups (matrix: col 1 = full match, col 2+ = groups)
str_match(exampletext, "\\bthe (\\w+)\\b")
     [,1]             [,2]        
[1,] "the production" "production"
Code
# All matches plus groups
str_match_all(exampletext, "\\bthe (\\w+)\\b")[[1]] |> head(5)
     [,1]               [,2]          
[1,] "the production"   "production"  
[2,] "the organisation" "organisation"
[3,] "the formation"    "formation"   
[4,] "the formation"    "formation"   
[5,] "the principles"   "principles"  

Replacing and Removing Patterns

Code
str_replace(exampletext, "grammar", "GRAMMAR") |> substr(1, 80)
[1] "Grammar is a system of rules which governs the production and use of utterances "
Code
str_replace_all(exampletext, "\\band\\b", "&") |> substr(1, 80)
[1] "Grammar is a system of rules which governs the production & use of utterances in"
Code
str_remove(exampletext, "\\bgrammar\\b") |> substr(1, 80)
[1] "Grammar is a system of rules which governs the production and use of utterances "
Code
str_remove_all(exampletext, "[,;.]") |> substr(1, 80)
[1] "Grammar is a system of rules which governs the production and use of utterances "

Splitting Strings

Code
# str_split: returns a list
str_split(exampletext, "\\s+")[[1]] |> head(8)
[1] "Grammar" "is"      "a"       "system"  "of"      "rules"   "which"  
[8] "governs"
Code
# str_split_fixed: returns a matrix with exactly n columns
str_split_fixed(sentences, "\\s+", n = 3)
     [,1]   [,2]   [,3]                  
[1,] "This" "is"   "a first sentence."   
[2,] "This" "is"   "a second sentence."  
[3,] "And"  "this" "is a third sentence."
Code
# Split on sentence boundaries (lookbehind for .!?)
str_split(exampletext, "(?<=[.!?])\\s+")[[1]]
[1] "Grammar is a system of rules which governs the production and use of utterances in a given language."                                                                                                                                                                                                   
[2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
[3] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."                                                                                                                                                                         

Subsetting Strings

Code
str_sub(exampletext, 1, 60)                           # by character position
[1] "Grammar is a system of rules which governs the production an"
Code
str_subset(splitexampletext, "grammar|syntax")        # keep matching elements
[1] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
[2] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."                                                                                                                                                                         
Code
str_trunc(splitexampletext, width = 45)               # truncate with "..."
[1] "Grammar is a system of rules which governs..."
[2] "These rules apply to sound as well as mean..."
[3] "Many modern theories that deal with the pr..."

Padding, Whitespace, and Truncation

String formatting for table output, report generation, and aligned displays is one of the most practically useful areas of stringr.

Code
# str_trim: remove leading and trailing whitespace
messy <- "  This   has  extra   spaces.  "
str_trim(messy)
[1] "This   has  extra   spaces."
Code
# str_squish: remove leading/trailing AND internal runs of whitespace
str_squish(messy)
[1] "This has extra spaces."
Code
# str_pad: add characters to reach a target width
# Useful for aligning columns in plain-text reports
words_ex <- c("the", "corpus", "linguistics", "syntax")
str_pad(words_ex, width = 15, side = "right")   # left-aligned (pad right)
[1] "the            " "corpus         " "linguistics    " "syntax         "
Code
str_pad(words_ex, width = 15, side = "left")    # right-aligned (pad left)
[1] "            the" "         corpus" "    linguistics" "         syntax"
Code
str_pad(words_ex, width = 15, side = "both")    # centred
[1] "      the      " "    corpus     " "  linguistics  " "    syntax     "
Code
# Custom pad character (e.g. for dot-leaders in a table of contents)
str_pad(words_ex, width = 20, side = "right", pad = ".")
[1] "the................." "corpus.............." "linguistics........."
[4] "syntax.............."
Code
# str_trunc with different sides
str_trunc("A very long sentence about linguistics.", width = 25, side = "right")
[1] "A very long sentence a..."
Code
str_trunc("A very long sentence about linguistics.", width = 25, side = "left")
[1] "...nce about linguistics."
Code
str_trunc("A very long sentence about linguistics.", width = 25, side = "center")
[1] "A very long...inguistics."
Code
# Practical example: create an aligned plain-text frequency table
word_freqs <- data.frame(
  word = c("grammar", "syntax", "morphology", "phonology", "semantics"),
  freq = c(42, 38, 27, 19, 14),
  stringsAsFactors = FALSE
)

# Format for aligned display
word_freqs |>
  dplyr::mutate(
    word_padded = str_pad(word, width = 12, side = "right"),
    freq_padded = str_pad(as.character(freq), width = 6, side = "left"),
    pct         = round(100 * freq / sum(freq), 1),
    pct_padded  = str_pad(paste0(pct, "%"), width = 7, side = "left")
  ) |>
  dplyr::mutate(row = paste(word_padded, freq_padded, pct_padded)) |>
  dplyr::pull(row) |>
  (\(x) c("Word         Count    Pct",
           paste(rep("-", 27), collapse = ""),
           x))() |>
  cat(sep = "\n")
Word         Count    Pct
---------------------------
grammar          42     30%
syntax           38   27.1%
morphology       27   19.3%
phonology        19   13.6%
semantics        14     10%
Number Formatting with formatC() and sprintf()

For numeric string formatting, base R’s formatC() and sprintf() complement str_pad():

# Fixed decimal places
formatC(3.14159, digits = 3, format = "f")   # "3.142"

# Thousands separator
formatC(12345678, format = "d", big.mark = ",")  # "12,345,678"

# sprintf: C-style formatting
sprintf("Mean RT = %.1f ms (SD = %.1f)", 612.4, 87.3)

# Percentage formatting
sprintf("%.1f%%", 0.347 * 100)   # "34.7%"

Combining and Interpolating Strings

str_c() and str_flatten()

Code
# str_c: concatenate element-wise (NA-safe unlike paste0)
str_c("P", str_pad(1:5, 2, pad = "0"), sep = "")   # P01, P02, ...
[1] "P01" "P02" "P03" "P04" "P05"
Code
# str_c with NA: propagates NA (unlike paste0 which gives "NA")
str_c("prefix_", c("a", NA, "c"))
[1] "prefix_a" NA         "prefix_c"
Code
paste0("prefix_", c("a", NA, "c"))    # compare: NA becomes "prefixNA"
[1] "prefix_a"  "prefix_NA" "prefix_c" 
Code
# str_flatten: collapse a vector to a single string
str_flatten(sentences, collapse = " ")
[1] "This is a first sentence. This is a second sentence. And this is a third sentence."
Code
str_flatten(c("cat", "dog", "bird"), collapse = ", ", last = " and ")
[1] "cat, dog and bird"

str_glue(): String Interpolation

str_glue() embeds R expressions directly in strings using {...} placeholders. This is far more readable than nested paste() calls and is the recommended approach for generating report text, axis labels, and data-driven narrative.

Code
# Basic interpolation
speaker  <- "P03"
n_tokens <- 1247
lang     <- "English"

str_glue("Speaker {speaker} (L1: {lang}) produced {n_tokens} tokens.")
Speaker P03 (L1: English) produced 1247 tokens.
Code
# Arithmetic inside {}
str_glue("Mean rate: {round(n_tokens / 60, 1)} tokens per minute.")
Mean rate: 20.8 tokens per minute.
Code
# Conditional text
proficiency <- "Advanced"
str_glue("Speaker {speaker} is {tolower(proficiency)}.",
         " ",
         "Their token count was {ifelse(n_tokens > 1000, 'above', 'below')} 1,000.")
Speaker P03 is advanced. Their token count was above 1,000.
Code
# Multi-line glue (newlines are preserved unless you collapse)
str_glue(
  "--- Speaker Report ---\n",
  "ID:          {speaker}\n",
  "L1:          {lang}\n",
  "Tokens:      {n_tokens}\n",
  "Proficiency: {proficiency}"
)
--- Speaker Report ---
ID:          P03
L1:          English
Tokens:      1247
Proficiency: Advanced

str_glue_data(): Interpolation Over a Data Frame

str_glue_data() applies the template to every row of a data frame. This is ideal for generating per-participant summaries, axis labels, or APA-style results sentences.

Code
# Sample participant data
participants <- data.frame(
  id          = paste0("P", str_pad(1:6, 2, pad = "0")),
  l1          = c("English", "German", "French", "Japanese", "Spanish", "Mandarin"),
  tokens      = c(1247, 983, 1105, 876, 1031, 942),
  accuracy    = c(0.92, 0.87, 0.89, 0.84, 0.91, 0.86),
  proficiency = c("Advanced", "Intermediate", "Advanced",
                  "Intermediate", "Advanced", "Intermediate"),
  stringsAsFactors = FALSE
)

# Generate one summary sentence per participant
participants |>
  str_glue_data(
    "Speaker {id} (L1: {l1}, {proficiency}) produced {tokens} tokens ",
    "with {round(accuracy * 100, 1)}% accuracy."
  )
Speaker P01 (L1: English, Advanced) produced 1247 tokens with 92% accuracy.
Speaker P02 (L1: German, Intermediate) produced 983 tokens with 87% accuracy.
Speaker P03 (L1: French, Advanced) produced 1105 tokens with 89% accuracy.
Speaker P04 (L1: Japanese, Intermediate) produced 876 tokens with 84% accuracy.
Speaker P05 (L1: Spanish, Advanced) produced 1031 tokens with 91% accuracy.
Speaker P06 (L1: Mandarin, Intermediate) produced 942 tokens with 86% accuracy.
Code
# Generate APA-style result sentences for each comparison
results_df <- data.frame(
  comparison = c("Primed vs. Unprimed", "High- vs. Low-Frequency"),
  beta       = c(-0.082, -0.051),
  se         = c(0.018, 0.013),
  t_val      = c(-4.56, -3.92),
  p_val      = c(0.0001, 0.0009),
  stringsAsFactors = FALSE
)

results_df |>
  str_glue_data(
    "{comparison}: β = {round(beta, 3)}, SE = {round(se, 3)}, ",
    "t = {round(t_val, 2)}, p {ifelse(p_val < .001, '< .001', paste0('= ', round(p_val, 3)))}."
  )
Primed vs. Unprimed: β = -0.082, SE = 0.018, t = -4.56, p < .001.
High- vs. Low-Frequency: β = -0.051, SE = 0.013, t = -3.92, p < .001.
When to Use str_glue() vs. paste()

Use str_glue() whenever you have more than one or two variables to embed in a string. The {variable} syntax reads naturally as prose and supports arbitrary R expressions, while paste() becomes hard to read as the number of arguments grows. For vectorised row-by-row generation from a data frame, always prefer str_glue_data() over apply() + paste().

Sorting and Ordering

Code
str_sort(sentences)                                    # default locale
[1] "And this is a third sentence." "This is a first sentence."    
[3] "This is a second sentence."   
Code
str_sort(sentences, decreasing = TRUE)
[1] "This is a second sentence."    "This is a first sentence."    
[3] "And this is a third sentence."
Code
# Locale matters for non-English alphabets
nordic <- c("ångström", "öl", "äpple", "banan", "citron")
str_sort(nordic)                     # incorrect for Swedish
[1] "ångström" "äpple"    "banan"    "citron"   "öl"      
Code
str_sort(nordic, locale = "sv")      # correct Swedish alphabetical order
[1] "banan"    "citron"   "ångström" "äpple"    "öl"      
Code
str_order(sentences)                 # returns ordering indices
[1] 3 1 2
Your turn!

Q2 You have an interview transcript and want to replace every occurrence of a participant’s real name (“Sarah”) with the pseudonym “P01”. Which stringr function is correct?





Q3 Which stringr functions manipulate whitespace? (Select all that apply.)







Working with Factors as Strings

Section Overview

What you will learn: How factors differ from character vectors; why factor level ordering matters for plots and models; and how to use forcats to relabel, reorder, collapse, and filter factor levels — tasks that arise constantly when working with categorical linguistic data (POS tags, speaker groups, genre labels, annotation codes)

Factors vs. Character Vectors

A factor is a categorical variable stored as integers with a character levels attribute. Factors are essential for:

  • Controlling the order of categories in plots (without factors, ggplot2 sorts alphabetically)
  • Setting reference levels in regression models
  • Summarising data by a fixed set of categories (including empty ones)
Code
# Character vector vs. factor
pos_chars  <- c("NN", "VBZ", "DT", "NN", "JJ", "NN", "VBZ", "RB")
pos_factor <- factor(pos_chars,
                     levels = c("DT", "JJ", "NN", "RB", "VBZ"))

# Key differences
class(pos_chars)       # "character"
[1] "character"
Code
class(pos_factor)      # "factor"
[1] "factor"
Code
levels(pos_factor)     # the defined level set, in order
[1] "DT"  "JJ"  "NN"  "RB"  "VBZ"
Code
nlevels(pos_factor)    # number of levels
[1] 5
Code
# A factor remembers ALL levels even if some are absent in the data
absent_level <- factor(c("A", "B"), levels = c("A", "B", "C"))
table(absent_level)    # C appears with count 0
absent_level
A B C 
1 1 0 

The forcats Package

forcats (loaded as part of the tidyverse) provides a coherent set of functions for working with factors. All function names begin with fct_.

Reordering Levels

Code
# Sample annotation data
anno_df <- data.frame(
  token = c("the", "corpus", "contains", "very", "interesting", "data",
            "the", "speaker", "spoke", "quite", "quickly", "today"),
  upos  = c("DT", "NN", "VBZ", "RB", "JJ", "NN",
             "DT", "NN", "VBD", "RB", "RB", "NN"),
  stringsAsFactors = FALSE
)

# Without forcats: alphabetical order in plot (rarely what we want)
ggplot(anno_df, aes(x = upos)) +
  geom_bar(fill = "steelblue") +
  theme_bw() +
  labs(title = "POS distribution (alphabetical — default)")

Code
# fct_infreq: order by descending frequency
anno_df |>
  dplyr::mutate(upos = forcats::fct_infreq(upos)) |>
  ggplot(aes(x = upos)) +
  geom_bar(fill = "steelblue") +
  theme_bw() +
  labs(title = "POS distribution (ordered by frequency)")

Code
# fct_rev: reverse current level order
anno_df |>
  dplyr::mutate(upos = forcats::fct_rev(forcats::fct_infreq(upos))) |>
  ggplot(aes(x = upos)) +
  geom_col(stat = "count", fill = "steelblue") +
  coord_flip() +
  theme_bw() +
  labs(title = "POS distribution (frequency order, horizontal)")

Code
# fct_reorder: order a factor by a summary statistic of another variable
rt_df <- data.frame(
  condition   = rep(c("Primed", "Unprimed", "Filler"), each = 40),
  rt          = c(rnorm(40, 580, 60), rnorm(40, 650, 70), rnorm(40, 700, 80))
)

# Without reordering: arbitrary condition order
rt_df |>
  dplyr::mutate(condition = forcats::fct_reorder(condition, rt, .fun = median)) |>
  ggplot(aes(x = condition, y = rt, fill = condition)) +
  geom_boxplot(show.legend = FALSE) +
  theme_bw() +
  labs(title = "RT by condition (ordered by median RT)",
       x = "Condition", y = "Reaction time (ms)")

Relabelling Levels

Code
# fct_recode: rename individual levels
pos_factor_labelled <- forcats::fct_recode(
  factor(anno_df$upos),
  "Determiner"  = "DT",
  "Adjective"   = "JJ",
  "Noun"        = "NN",
  "Adverb"      = "RB",
  "Verb (past)" = "VBD",
  "Verb (pres)" = "VBZ"
)
levels(pos_factor_labelled)
[1] "Determiner"  "Adjective"   "Noun"        "Adverb"      "Verb (past)"
[6] "Verb (pres)"
Code
table(pos_factor_labelled)
pos_factor_labelled
 Determiner   Adjective        Noun      Adverb Verb (past) Verb (pres) 
          2           1           4           3           1           1 
Code
# fct_relabel: apply a function to ALL level names at once
pos_lower <- forcats::fct_relabel(factor(anno_df$upos), tolower)
levels(pos_lower)
[1] "dt"  "jj"  "nn"  "rb"  "vbd" "vbz"

Collapsing and Lumping Levels

When a factor has many levels, it is often useful to collapse rare or related levels into a single catch-all category.

Code
# Simulate a larger POS-tagged corpus
set.seed(42)
all_pos <- sample(
  c("NN", "VBZ", "DT", "JJ", "RB", "IN", "PRP", "VBD", "NNS", "VBP",
    "CC", "MD", "WP", "EX", "UH"),
  size    = 200,
  replace = TRUE,
  prob    = c(0.20, 0.12, 0.11, 0.09, 0.08, 0.07, 0.06, 0.06,
              0.05, 0.04, 0.04, 0.03, 0.02, 0.02, 0.01)
)

pos_factor_full <- factor(all_pos)
nlevels(pos_factor_full)  # 15 levels — hard to visualise
[1] 14
Code
# fct_lump_n: keep the n most frequent levels, collapse the rest to "Other"
pos_lumped_5 <- forcats::fct_lump_n(pos_factor_full, n = 5)
table(pos_lumped_5)
pos_lumped_5
   DT    JJ    NN   VBD   VBZ Other 
   19    16    41    22    18    84 
Code
# fct_lump_prop: keep levels accounting for > prop of observations
pos_lumped_prop <- forcats::fct_lump_prop(pos_factor_full, prop = 0.05)
table(pos_lumped_prop)
pos_lumped_prop
   DT    IN    JJ    NN   NNS   PRP    RB   VBD   VBZ Other 
   19    15    16    41    12    13    14    22    18    30 
Code
# fct_other: manually specify which levels to keep (all others → "Other")
pos_content <- forcats::fct_other(
  pos_factor_full,
  keep = c("NN", "NNS", "VBZ", "VBD", "VBP", "JJ"),
  other_level = "Function"
)
table(pos_content)
pos_content
      JJ       NN      NNS      VBD      VBP      VBZ Function 
      16       41       12       22        9       18       82 

Adding and Dropping Levels

Code
# fct_drop: remove levels that have no observations
all_genres <- factor(c("academic", "fiction", "news"),
                     levels = c("academic", "fiction", "news", "spoken", "web"))
nlevels(all_genres)           # 5 levels
[1] 5
Code
nlevels(forcats::fct_drop(all_genres))  # 3 levels
[1] 3
Code
# fct_expand: add new levels (useful before rbind-ing data frames)
expanded <- forcats::fct_expand(all_genres, "social_media", "blog")
levels(expanded)
[1] "academic"     "fiction"      "news"         "spoken"       "web"         
[6] "social_media" "blog"        
Code
# fct_na_value_to_level: treat NA as an explicit factor level
with_na  <- factor(c("academic", NA, "fiction", NA, "news"))
with_na_level <- forcats::fct_na_value_to_level(with_na, level = "Unknown")
table(with_na_level, useNA = "always")
with_na_level
academic  fiction     news  Unknown     <NA> 
       1        1        1        2        0 

A researcher has a factor genre with levels in alphabetical order: "academic", "fiction", "news", "spoken". She wants to reorder the bars in a ggplot2 bar chart so that the most frequent genre appears first. Which forcats function should she use?

  1. fct_reorder(genre, genre) — reorder by alphabetical value
  2. fct_infreq(genre) — reorder levels by descending frequency of observations
  3. fct_rev(genre) — reverse the current alphabetical order
  4. fct_recode(genre) — rename the level labels
Answer

b) fct_infreq(genre) — reorder levels by descending frequency of observations

fct_infreq() reorders factor levels so that the most frequently occurring level comes first, which is exactly what places it as the first bar in a bar chart. fct_reorder() (option a) reorders by a summary statistic of another variable (e.g. median RT), not by the factor’s own frequency. fct_rev() only reverses the existing order without considering frequency. fct_recode() changes level names, not order.


Unicode, Encoding, and Non-ASCII Characters

Section Overview

What you will learn: What text encoding is and why it matters for linguistic data; how to detect and fix encoding problems; how to work with IPA symbols, non-Latin scripts, and Unicode special characters in R; and locale-aware case conversion for non-English languages

What Is Text Encoding?

A character encoding maps characters to binary numbers. The most important encodings for linguistic research are:

Common text encodings
Encoding Coverage When you encounter it
UTF-8 All Unicode characters (~150,000) Modern files, web data, recommended default
Latin-1 / ISO-8859-1 Western European languages Older files, Windows legacy
Windows-1252 (CP1252) Western European + smart quotes Files created on Windows
UTF-16 All Unicode (2 or 4 bytes) Some Windows apps, older XML
Always Use UTF-8

Save all R scripts and data files in UTF-8. In RStudio: File → Save with Encoding → UTF-8. Set your default in Tools → Global Options → Code → Saving → Default text encoding: UTF-8. Nearly all encoding headaches arise from mixing UTF-8 and Latin-1 files.

Detecting and Converting Encodings

Code
# str_conv: convert encoding
latin1_text <- iconv("café résumé naïve", to = "latin1")
utf8_text   <- stringr::str_conv(latin1_text, encoding = "latin1")
utf8_text
[1] "café résumé naïve"
Code
# iconv: lower-level conversion with error handling
# sub = "byte": replace invalid bytes with their hex code (never fails)
# sub = NA:     return NA for strings with invalid bytes (for detection)
mixed <- c("valid UTF-8", iconv("caf\xe9", from = "latin1", to = "UTF-8"))
iconv(mixed, from = "UTF-8", to = "UTF-8", sub = NA)
[1] "valid UTF-8" "café"       
Code
# Detect encoding of an unknown file (requires stringi)
# stringi::stri_enc_detect(readBin("unknown_file.txt", "raw", 10000))

IPA and Phonetic Symbols

IPA symbols are fully supported in R as UTF-8 Unicode code points:

Code
# IPA transcriptions
ipa <- c(
  "linguistics"  = "/lɪŋˈɡwɪstɪks/",
  "phonology"    = "/fəˈnɒlədʒi/",
  "morphology"   = "/mɔːˈfɒlədʒi/",
  "syntax"       = "/ˈsɪntæks/",
  "semantics"    = "/sɪˈmæntɪks/"
)

nchar(ipa)                              # character count per transcription
linguistics   phonology  morphology      syntax   semantics 
         14          12          13          10          12 
Code
str_detect(ipa, "ɪ")                   # detect the IPA SMALL CAPITAL I
[1]  TRUE FALSE FALSE  TRUE  TRUE
Code
str_extract_all(ipa, "[ˈˌ][^ˈˌ/]+")   # extract stressed syllables
[[1]]
[1] "ˈɡwɪstɪks"

[[2]]
[1] "ˈnɒlədʒi"

[[3]]
[1] "ˈfɒlədʒi"

[[4]]
[1] "ˈsɪntæks"

[[5]]
[1] "ˈmæntɪks"
Code
# Remove stress marks and syllable boundaries
str_remove_all(ipa, "[ˈˌ.\\-]")
[1] "/lɪŋɡwɪstɪks/" "/fənɒlədʒi/"   "/mɔːfɒlədʒi/"  "/sɪntæks/"    
[5] "/sɪmæntɪks/"  
Code
# Extract only vowels (broad IPA vowel symbols)
vowels_ipa <- "[aeiouæɑɒɔəɛɜɪʊʌ]"
str_extract_all(ipa, vowels_ipa) |>
  purrr::map(~ paste(.x, collapse = "")) |>
  unlist()
[1] "ɪɪɪ"  "əɒəi" "ɔɒəi" "ɪæ"   "ɪæɪ" 

Non-Latin Scripts

Code
# R handles any Unicode script natively
arabic   <- "اللغويات"          # Arabic: "linguistics"
chinese  <- "语言学"             # Mandarin: "linguistics"
japanese <- "言語学"             # Japanese: "linguistics"
greek    <- "γλωσσολογία"        # Greek: "glōssología"
russian  <- "лингвистика"        # Russian: "lingvistika"
hindi    <- "भाषाविज्ञान"       # Hindi: "bhāṣāvijñāna"

scripts  <- c(arabic, chinese, japanese, greek, russian, hindi)
nchar(scripts)                   # character count (code points)
[1]  8  3  3 11 11 11
Code
# str_length is an alias for nchar in stringr
str_length(scripts)
[1]  8  3  3 11 11 11
Code
# Detect Cyrillic characters
str_detect(scripts, "\\p{Script=Cyrillic}")
[1] FALSE FALSE FALSE FALSE  TRUE FALSE
Code
# Detect CJK characters (Chinese/Japanese/Korean)
str_detect(scripts, "\\p{Script=Han}")
[1] FALSE  TRUE  TRUE FALSE FALSE FALSE
Unicode Script Properties in Regex

PCRE (which stringr uses) supports Unicode property escapes of the form \p{Property=Value}. Useful ones for linguists:

Unicode property escapes
Pattern Matches
\p{L} Any Unicode letter
\p{Lu} Uppercase letter
\p{Ll} Lowercase letter
\p{N} Any numeric character
\p{Script=Latin} Latin-script characters
\p{Script=Arabic} Arabic-script characters
\p{Script=Han} CJK characters

Locale-Aware Case Conversion

Code
# Turkish has dotted/dotless i — standard tolower/toupper fails
str_to_upper("istanbul", locale = "tr")   # İSTANBUL (correct for Turkish)
[1] "İSTANBUL"
Code
str_to_upper("istanbul", locale = "en")   # ISTANBUL (English behaviour)
[1] "ISTANBUL"
Code
str_to_lower("İSTANBUL", locale = "tr")   # istanbul
[1] "istanbul"
Code
str_to_lower("İSTANBUL", locale = "en")   # i̇stanbul (wrong for Turkish)
[1] "i̇stanbul"
Code
# German sharp s
str_to_upper("straße", locale = "de")     # STRASSE (ß → SS in uppercase)
[1] "STRASSE"
Code
# str_to_title: capitalise first letter of each word
str_to_title("the quick brown fox", locale = "en")
[1] "The Quick Brown Fox"

You are processing a corpus of files downloaded from an older German website. After reading the files with readLines(), some strings contain the bytes \xfc (ü), \xe4 (ä), and \xf6 (ö), appearing as garbled characters. What is the most likely cause and the correct fix?

  1. The files are corrupted — re-download them
  2. The files are encoded in Latin-1 (or Windows-1252), not UTF-8. Use readLines(f, encoding = "latin1") or iconv(text, from = "latin1", to = "UTF-8")
  3. R does not support German characters — use Python instead
  4. Use str_squish() to clean the garbled bytes
Answer

b) The files are encoded in Latin-1 (or Windows-1252), not UTF-8

The byte values \xfc, \xe4, and \xf6 are the Latin-1 encodings of ü, ä, and ö — common German characters. When R reads a file assuming UTF-8 but the file is Latin-1, these multi-byte characters appear garbled. The fix is to read with the correct encoding: readLines(f, encoding = "latin1"), or convert afterwards with iconv(text, from = "latin1", to = "UTF-8"). Option (d) is wrong — str_squish() handles whitespace only and has no effect on byte values.


Regular Expressions

Section Overview

What you will learn: How to write regex patterns using character classes, quantifiers, anchors, alternation, groups, named capture groups, and lookahead/lookbehind — with linguistic examples throughout. The focus is on patterns that arise in real linguistic data processing.

Special Characters and Escaping

Most characters match themselves literally. The following have special meaning and must be escaped with \\ in R strings:

. * + ? ^ $ ( ) [ ] { } | \

Code
# Match a literal full stop (. means "any character" in regex)
str_detect(c("end.", "end!"), "end\\.")   # only "end." matches
[1]  TRUE FALSE
Code
# Match a literal parenthesis
str_extract("Syntax (Chomsky 1957)", "\\([^)]+\\)")
[1] "(Chomsky 1957)"

Character Classes

Code
str_extract_all("linguistics", "[aeiou]")[[1]]          # vowels only
[1] "i" "u" "i" "i"
Code
str_extract_all("Word1 word2", "[A-Za-z]+")[[1]]        # letter sequences
[1] "Word" "word"
Code
str_extract_all("Score: 4/5", "[^A-Za-z: /]")[[1]]     # negated class
[1] "4" "5"
Code
# Shorthand classes
# \\d = [0-9]   \\D = [^0-9]
# \\w = [A-Za-z0-9_]   \\W = non-word
# \\s = whitespace      \\S = non-whitespace
# \\b = word boundary (zero-width)

str_extract_all("Call 0412 345 678", "\\d+")[[1]]
[1] "0412" "345"  "678" 
Code
str_extract_all("one two three", "\\b\\w+\\b")[[1]]
[1] "one"   "two"   "three"

Quantifiers

Regex quantifiers
Quantifier Meaning Example
? 0 or 1 colou?r → colour, color
* 0 or more \\d* → zero or more digits
+ 1 or more \\d+ → one or more digits
{n} Exactly n \\w{4} → four-letter words
{n,m} Between n and m \\d{2,4} → 2–4 digits
*? +? Lazy (minimal) Match as little as possible
Code
verbs <- c("walk", "walks", "walking", "walked", "runner")

str_subset(verbs, "\\w+ing$")           # -ing forms
[1] "walking"
Code
str_subset(verbs, "\\w+ed$")            # -ed forms
[1] "walked"
Code
str_subset(verbs, "^\\w{4}$")           # exactly 4 characters
[1] "walk"
Code
str_detect(c("colour", "color"), "colou?r")  # optional u
[1] TRUE TRUE
Code
# Greedy vs. lazy
quoted <- 'She said "very" and he said "quite good"'
str_extract(quoted, '".*"')             # greedy: first to last "
[1] "\"very\" and he said \"quite good\""
Code
str_extract(quoted, '".*?"')            # lazy:  first to next "
[1] "\"very\""

Anchors and Word Boundaries

Code
lines <- c("Grammar is structural.", "The grammar of English.", "grammar matters.")

str_subset(lines, "^[A-Z]")            # starts with capital letter
[1] "Grammar is structural."  "The grammar of English."
Code
str_subset(lines, "\\.$")              # ends with full stop
[1] "Grammar is structural."  "The grammar of English."
[3] "grammar matters."       
Code
# Word boundaries prevent partial matches
str_count(exampletext, "the")          # matches "the", "other", "there"...
[1] 6
Code
str_count(exampletext, "\\bthe\\b")    # only the exact word "the"
[1] 5

Alternation and Groups

Code
# Alternation: | inside ()
str_subset(
  c("very nice", "quite good", "so interesting", "fairly common"),
  "\\b(very|quite|so|fairly)\\b"
)
[1] "very nice"      "quite good"     "so interesting" "fairly common" 
Code
# Grouping for quantifiers
str_detect(c("haha", "hahaha", "ha", "hahahahaha"), "(ha){2,}")
[1]  TRUE  TRUE FALSE  TRUE
Code
# Back-references: \\1 matches what group 1 captured
redupl <- c("so so tired", "very very slowly", "quite good")
str_detect(redupl, "\\b(\\w+) \\1\\b")   # reduplicated word
[1]  TRUE  TRUE FALSE
Code
str_match(redupl, "\\b(\\w+) \\1\\b")[, 2]  # extract the word
[1] "so"   "very" NA    
Code
# Match colour/color
str_detect(c("colour", "color", "colouring"), "colou?r")
[1] TRUE TRUE TRUE

Named Capture Groups

Named capture groups ((?<name>...)) make complex extraction readable and robust. The group’s value can be accessed by name from the result matrix, which is safer than relying on column position.

Code
# Extract structured information from POS-tagged text
# Format: WORD/POS/LEMMA
tagged <- c("The/DT/the", "corpus/NN/corpus", "contains/VBZ/contain",
            "very/RB/very", "interesting/JJ/interesting", "data/NN/datum")

pattern <- "(?<word>[^/]+)/(?<pos>[^/]+)/(?<lemma>[^/]+)"
m <- str_match(tagged, pattern)

anno_df <- data.frame(
  word  = m[, "word"],
  pos   = m[, "pos"],
  lemma = m[, "lemma"],
  stringsAsFactors = FALSE
)
anno_df
         word pos       lemma
1         The  DT         the
2      corpus  NN      corpus
3    contains VBZ     contain
4        very  RB        very
5 interesting  JJ interesting
6        data  NN       datum
Code
# Extract IPA transcriptions from formatted dictionary entries
dict <- c(
  "linguistics /lɪŋˈɡwɪstɪks/ noun",
  "phonology /fəˈnɒlədʒi/ noun",
  "morphology /mɔːˈfɒlədʒi/ noun",
  "syntax /ˈsɪntæks/ noun"
)
ipa_pattern <- "(?<word>\\w+) /(?<ipa>[^/]+)/ (?<pos>\\w+)"
ipa_m       <- str_match(dict, ipa_pattern)

data.frame(
  word = ipa_m[, "word"],
  ipa  = ipa_m[, "ipa"],
  pos  = ipa_m[, "pos"],
  stringsAsFactors = FALSE
)
         word          ipa  pos
1 linguistics lɪŋˈɡwɪstɪks noun
2   phonology   fəˈnɒlədʒi noun
3  morphology  mɔːˈfɒlədʒi noun
4      syntax     ˈsɪntæks noun
Code
# Named groups with str_match_all for multiple matches per string
# Extract all citation references: Author (Year) format
text_with_cites <- paste(
  "As Chomsky (1957) argued, and later confirmed by Labov (1972),",
  "sociolinguistic variation (Trudgill 1974; Milroy 1980) is systematic."
)

cite_pattern <- "(?<author>[A-Z][a-z]+)\\s+\\((?<year>\\d{4})\\)"
cite_matches <- str_match_all(text_with_cites, cite_pattern)[[1]]

data.frame(
  author = cite_matches[, "author"],
  year   = as.integer(cite_matches[, "year"]),
  stringsAsFactors = FALSE
)
   author year
1 Chomsky 1957
2   Labov 1972

Lookahead and Lookbehind

Lookaround assertions match a position relative to a pattern without including the pattern itself in the match result.

Lookaround syntax
Assertion Syntax Meaning
Positive lookahead (?=...) Position followed by …
Negative lookahead (?!...) Position NOT followed by …
Positive lookbehind (?<=...) Position preceded by …
Negative lookbehind (?<!...) Position NOT preceded by …
Code
# Words immediately preceding "grammar"
str_extract_all(exampletext, "\\w+(?=\\s+grammar)")[[1]]
[1] "of"
Code
# Words immediately following "the"
str_extract_all(exampletext, "(?<=\\bthe\\s)\\w+")[[1]]
[1] "production"   "organisation" "formation"    "formation"    "principles"  
Code
# Amplified adjectives: adjectives following "very" or "quite"
amp_sent <- "The very beautiful garden and the quite interesting lecture."
str_extract_all(amp_sent, "(?<=very |quite )\\w+")[[1]]
[1] "beautiful"   "interesting"
Code
# Split on sentence boundaries WITHOUT consuming the punctuation
# (?<=[.!?]) = preceded by sentence-final punctuation
sentences_split <- str_split(exampletext, "(?<=[.!?])\\s+")[[1]]
sentences_split
[1] "Grammar is a system of rules which governs the production and use of utterances in a given language."                                                                                                                                                                                                   
[2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
[3] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."                                                                                                                                                                         

Practical Regex for Linguistic Data

Code
# 1. Extract all -ing forms
str_extract_all(exampletext, "\\b\\w+ing\\b")[[1]]
[1] "meaning"    "pertaining"
Code
# 2. Remove XML/HTML tags (common in corpus data)
tagged_text <- "<p>The <hi rend=\"italic\">corpus</hi> contains <b>data</b>.</p>"
str_remove_all(tagged_text, "<[^>]+>")
[1] "The corpus contains data."
Code
# 3. Extract quoted speech
narrative <- 'She said "I will return" and he replied "Good luck".'
str_extract_all(narrative, '"([^"]+)"')[[1]]
[1] "\"I will return\"" "\"Good luck\""    
Code
# 4. Extract year references from academic text
academic <- "Chomsky (1957), Labov (1972), and Trudgill (1974) all contributed."
str_extract_all(academic, "\\d{4}")[[1]]
[1] "1957" "1972" "1974"
Code
# 5. Detect passive constructions (rough heuristic)
passive_pat <- "\\b(is|are|was|were|been)\\s+\\w+ed\\b"
str_detect(splitexampletext, passive_pat)
[1] FALSE FALSE  TRUE
Code
# 6. Anonymise emails
emails_text <- "Contact martin@ladal.edu.au or admin@university.org for details."
str_replace_all(emails_text,
                "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
                "[EMAIL REDACTED]")
[1] "Contact [EMAIL REDACTED] or [EMAIL REDACTED] for details."
Your turn!

Q6 Which regex correctly matches whole words ending in -tion or -sion (e.g. intention, tension)?





Q7 You want to extract the word immediately after “very” in a text, without including “very” in the result. Which regex feature achieves this?






Text Cleaning Pipelines

Section Overview

What you will learn: How to combine multiple string operations into a single reusable cleaning function; common preprocessing steps for corpus linguistics; a tm-based pipeline and a stringr-based alternative; and how to apply either to a full directory of texts

Why Build a Pipeline?

Text cleaning for corpus analysis chains many steps — lowercasing, removing markup, stripping punctuation, removing numbers, eliminating stopwords, collapsing whitespace — and you need to apply the exact same sequence to every text. Encoding the pipeline as a function ensures reproducibility, transparency, and reusability.

When NOT to Remove Stopwords

Stopword removal is appropriate for topic modelling and keyword extraction. But it is inappropriate for grammatical analysis (function words are the data), discourse analysis (markers like well, so, I mean are usually stopwords but often exactly what you want), and sentiment analysis (negation words like not, never are on stopword lists but reverse polarity). Always check whether the words you remove are relevant to your research question.

The tm Building Blocks

Code
raw <- paste(
  "The study of Grammar (including <b>Syntax</b>, Morphology, and Phonology) is central",
  "to Linguistics. There are 3 main branches — explored by linguists since the 19th century."
)

tm::removeNumbers(raw)
[1] "The study of Grammar (including <b>Syntax</b>, Morphology, and Phonology) is central to Linguistics. There are  main branches — explored by linguists since the th century."
Code
tm::removePunctuation(raw)
[1] "The study of Grammar including bSyntaxb Morphology and Phonology is central to Linguistics There are 3 main branches — explored by linguists since the 19th century"
Code
tm::removeWords(raw, tm::stopwords("english"))
[1] "The study  Grammar (including <b>Syntax</b>, Morphology,  Phonology)  central  Linguistics. There  3 main branches — explored  linguists since  19th century."
Code
tm::stripWhitespace(raw)
[1] "The study of Grammar (including <b>Syntax</b>, Morphology, and Phonology) is central to Linguistics. There are 3 main branches — explored by linguists since the 19th century."
Code
tm::stemDocument(raw, language = "en")
[1] "The studi of Grammar (includ <b>Syntax</b>, Morphology, and Phonology) is central to Linguistics. There are 3 main branch — explor by linguist sinc the 19th century."

A Reusable tm-Based Pipeline

Code
clean_text_tm <- function(text,
                           lowercase     = TRUE,
                           rm_markup     = TRUE,
                           rm_punct      = TRUE,
                           rm_numbers    = TRUE,
                           rm_stopwords  = TRUE,
                           stopword_lang = "english",
                           stem          = FALSE,
                           squish_ws     = TRUE) {
  out <- text
  if (rm_markup)    out <- stringr::str_remove_all(out, "<[^>]+>")
  if (lowercase)    out <- tolower(out)
  if (rm_punct)     out <- tm::removePunctuation(out)
  if (rm_numbers)   out <- tm::removeNumbers(out)
  if (rm_stopwords) out <- tm::removeWords(out, tm::stopwords(stopword_lang))
  if (stem)         out <- tm::stemDocument(out, language = stopword_lang)
  if (squish_ws)    out <- tm::stripWhitespace(out)
  stringr::str_trim(out)
}

clean_text_tm(raw)
[1] "study grammar including syntax morphology phonology central linguistics main branches — explored linguists since th century"
Code
clean_text_tm(raw, rm_stopwords = FALSE) |> substr(1, 80)
[1] "the study of grammar including syntax morphology and phonology is central to lin"
Code
clean_text_tm(raw, stem = TRUE) |> substr(1, 80)
[1] "studi grammar includ syntax morpholog phonolog central linguist main branch — ex"

A stringr-Based Pipeline

The stringr alternative gives more control over punctuation rules and handles Unicode better:

Code
clean_text_stringr <- function(text,
                                lowercase     = TRUE,
                                rm_markup     = TRUE,
                                rm_punct      = TRUE,
                                rm_numbers    = TRUE,
                                rm_stopwords  = TRUE,
                                keep_hyphens  = TRUE,
                                squish_ws     = TRUE) {
  out <- text

  # 1. Remove XML/HTML markup
  if (rm_markup)  out <- str_remove_all(out, "<[^>]+>")

  # 2. Lowercase
  if (lowercase)  out <- str_to_lower(out)

  # 3. Remove punctuation (optionally keep internal hyphens)
  if (rm_punct) {
    if (keep_hyphens) {
      out <- str_remove_all(out, "[^\\w\\s\\-]")   # keep - inside words
    } else {
      out <- str_remove_all(out, "[^\\w\\s]")
    }
  }

  # 4. Remove numbers
  if (rm_numbers) out <- str_remove_all(out, "\\d+")

  # 5. Remove stopwords with word-boundary matching
  if (rm_stopwords) {
    stops   <- tm::stopwords("english")
    pattern <- str_c("\\b(", str_c(stops, collapse = "|"), ")\\b")
    out     <- str_remove_all(out, pattern)
  }

  # 6. Collapse whitespace
  if (squish_ws) out <- str_squish(out)

  out
}

clean_text_stringr(raw)
[1] "study grammar including syntax morphology phonology central linguistics main branches explored linguists since th century"
Code
# Demonstrate keep_hyphens option
hyphen_text <- "Well-known socio-linguistic phenomena include code-switching."
clean_text_stringr(hyphen_text, rm_stopwords = FALSE, keep_hyphens = TRUE)
[1] "well-known socio-linguistic phenomena include code-switching"
Code
clean_text_stringr(hyphen_text, rm_stopwords = FALSE, keep_hyphens = FALSE)
[1] "wellknown sociolinguistic phenomena include codeswitching"

Applying a Pipeline to a Corpus

Code
# Simulate a small corpus (in practice: read from files)
corpus_raw <- c(
  T01 = "The <b>grammar</b> of English has changed since the 1800s.",
  T02 = "Syntax deals with sentence structure — 3 main frameworks exist.",
  T03 = "Morphology examines word formation and the structure of words.",
  T04 = "Phonology studies the sound systems of languages (44 phonemes in English)."
)

# Apply pipeline to all texts
corpus_clean <- purrr::map_chr(corpus_raw, clean_text_stringr)

# Display before/after
data.frame(
  id     = names(corpus_raw),
  before = str_trunc(corpus_raw,   60),
  after  = str_trunc(corpus_clean, 60)
) |>
  flextable() |>
  flextable::set_table_properties(width = 1, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 10) |>
  flextable::set_caption("Corpus texts before and after cleaning pipeline")

id

before

after

T01

The <b>grammar</b> of English has changed since the 1800s.

grammar english changed since s

T02

Syntax deals with sentence structure — 3 main frameworks ...

syntax deals sentence structure main frameworks exist

T03

Morphology examines word formation and the structure of w...

morphology examines word formation structure words

T04

Phonology studies the sound systems of languages (44 phon...

phonology studies sound systems languages phonemes english

A researcher applies the pipeline lowercase → removePunctuation → removeStopwords → stripWhitespace to her corpus. She later finds that “not interesting” has become just “interesting” throughout, reversing the intended meaning of many sentences. Which step caused this and how should she fix it?

  1. lowercase — preserving capitalisation would have prevented this
  2. removePunctuation — punctuation carries semantic information
  3. removeStopwords — “not” is on the English stopword list; she should use a custom stopword list that excludes negation words, or skip stopword removal entirely for this analysis
  4. stripWhitespace — collapsing spaces altered the word sequence
Answer

c) removeStopwords

English stopword lists include negation words like not, never, no, nor, neither. Removing them from text that will be analysed for meaning or sentiment is a serious error because these words reverse the polarity of surrounding words. The fix: create a custom stopword list that excludes all negation words, or skip stopword removal and rely on your analysis method to handle function words appropriately.


Tokenisation with quanteda

Section Overview

What you will learn: What tokenisation is; the difference between word, sentence, and character tokenisation; how to use quanteda’s tokens() function with various options; and how to inspect, filter, and work with the resulting token objects

What Is Tokenisation?

Tokenisation is the process of splitting a text into a sequence of discrete units called tokens. A token is typically a word, but it can also be a sentence, character, n-gram, or any other unit depending on your analytical goal.

Tokenisation options in quanteda
Unit Function Returns Typical use
Sentence quanteda::tokenize_sentence() List of sentence strings Sentence-level analysis, KWIC
Word quanteda::tokens(what = "word") tokens object Frequency analysis, collocations
Character quanteda::tokens(what = "character") tokens object Character n-grams, orthographic analysis
N-gram quanteda::tokens_ngrams() tokens object Collocation, language models

Sentence Tokenisation

Code
# Split text into sentences
et_sentences <- quanteda::tokenize_sentence(exampletext) |> unlist()
et_sentences
[1] "Grammar is a system of rules which governs the production and use of utterances in a given language."                                                                                                                                                                                                   
[2] "These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences)."
[3] "Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."                                                                                                                                                                         
Code
# Works on a vector of texts too
multi_sent <- quanteda::tokenize_sentence(
  c(exampletext, additionaltext)
)
lengths(multi_sent)   # how many sentences per text?
[1] 3 4

Word Tokenisation

Code
# Build a quanteda corpus first
corp <- quanteda::corpus(
  c(exampletext, additionaltext),
  docnames = c("grammar", "saussure")
)

# Default word tokenisation (preserves punctuation)
toks_default <- quanteda::tokens(corp, what = "word")
head(as.character(toks_default[[1]]), 20)
 [1] "Grammar"    "is"         "a"          "system"     "of"        
 [6] "rules"      "which"      "governs"    "the"        "production"
[11] "and"        "use"        "of"         "utterances" "in"        
[16] "a"          "given"      "language"   "."          "These"     
Code
# Clean tokenisation: remove punctuation, symbols, numbers, URLs
toks_clean <- quanteda::tokens(
  corp,
  what           = "word",
  remove_punct   = TRUE,
  remove_symbols = TRUE,
  remove_numbers = FALSE,
  remove_url     = TRUE,
  split_hyphens  = FALSE   # keep "well-known" as one token
)
head(as.character(toks_clean[[1]]), 20)
 [1] "Grammar"    "is"         "a"          "system"     "of"        
 [6] "rules"      "which"      "governs"    "the"        "production"
[11] "and"        "use"        "of"         "utterances" "in"        
[16] "a"          "given"      "language"   "These"      "rules"     
Code
# Token counts
lengths(toks_clean)
 grammar saussure 
      81      111 

Removing Stopwords in quanteda

Code
# quanteda has built-in stopword lists
head(quanteda::stopwords("en"), 20)
 [1] "i"          "me"         "my"         "myself"     "we"        
 [6] "our"        "ours"       "ourselves"  "you"        "your"      
[11] "yours"      "yourself"   "yourselves" "he"         "him"       
[16] "his"        "himself"    "she"        "her"        "hers"      
Code
# Remove stopwords from tokens object
toks_nostop <- quanteda::tokens_remove(
  toks_clean,
  pattern = quanteda::stopwords("en"),
  padding = FALSE   # TRUE replaces removed tokens with "" (preserves positions)
)

head(as.character(toks_nostop[[1]]), 20)
 [1] "Grammar"      "system"       "rules"        "governs"      "production"  
 [6] "use"          "utterances"   "given"        "language"     "rules"       
[11] "apply"        "sound"        "well"         "meaning"      "include"     
[16] "componential" "subsets"      "rules"        "pertaining"   "phonology"   
Code
# Compare token counts before/after stopword removal
data.frame(
  text     = names(toks_clean),
  with_sw  = lengths(toks_clean),
  without_sw = lengths(toks_nostop)
) |>
  dplyr::mutate(pct_removed = round(100 * (1 - without_sw / with_sw), 1))
             text with_sw without_sw pct_removed
grammar   grammar      81         45        44.4
saussure saussure     111         64        42.3

Selecting and Filtering Tokens

Code
# Keep only tokens matching a pattern
toks_nouns <- quanteda::tokens_select(
  toks_clean,
  pattern   = c("grammar", "syntax", "morphology", "phonology",
                "language", "linguistic*"),   # * is a glob wildcard
  valuetype = "glob"
)
as.character(toks_nouns[[1]])
[1] "Grammar"     "language"    "phonology"   "morphology"  "syntax"     
[6] "grammar"     "linguistics"
Code
# tokens_select with regex
toks_ing <- quanteda::tokens_select(
  toks_clean,
  pattern   = "\\w+ing",
  valuetype = "regex"
)
as.character(toks_ing[[1]])
[1] "meaning"     "pertaining"  "linguistics"

N-Grams

N-grams are consecutive sequences of n tokens. Bigrams (n=2) and trigrams (n=3) are especially useful for collocation analysis and language modelling.

Code
# Extract bigrams
toks_bigrams <- quanteda::tokens_ngrams(toks_nostop, n = 2)
head(as.character(toks_bigrams[[1]]), 15)
 [1] "Grammar_system"       "system_rules"         "rules_governs"       
 [4] "governs_production"   "production_use"       "use_utterances"      
 [7] "utterances_given"     "given_language"       "language_rules"      
[10] "rules_apply"          "apply_sound"          "sound_well"          
[13] "well_meaning"         "meaning_include"      "include_componential"
Code
# Skipgrams: pairs with up to k tokens skipped between them
toks_skip2 <- quanteda::tokens_ngrams(toks_nostop, n = 2, skip = 0:2)
head(as.character(toks_skip2[[1]]), 15)
 [1] "Grammar_system"        "Grammar_rules"         "Grammar_governs"      
 [4] "system_rules"          "system_governs"        "system_production"    
 [7] "rules_governs"         "rules_production"      "rules_use"            
[10] "governs_production"    "governs_use"           "governs_utterances"   
[13] "production_use"        "production_utterances" "production_given"     
Code
# Convert to a document-feature matrix for analysis
dfm_bigrams <- quanteda::dfm(toks_bigrams)
# Top features by frequency
quanteda::topfeatures(dfm_bigrams, n = 10)
         system_rules formation_composition    chomsky_competence 
                    2                     2                     2 
       grammar_system         rules_governs    governs_production 
                    1                     1                     1 
       production_use        use_utterances      utterances_given 
                    1                     1                     1 
       given_language 
                    1 

Document-Feature Matrix (DFM)

The document-feature matrix (DFM) represents a corpus as a matrix where rows are documents and columns are features (tokens). It is the standard input for most corpus-statistical analyses.

Code
# Build DFM from clean tokens
dfm_clean <- quanteda::dfm(toks_clean)
dfm_clean
Document-feature matrix of: 2 documents, 111 features (42.34% sparse) and 0 docvars.
          features
docs       grammar is a system of rules which governs the production
  grammar        2  1 2      1  8     3     1       1   5          1
  saussure       1  4 1      1  5     1     1       0   6          0
[ reached max_nfeat ... 101 more features ]
Code
# Dimensions: documents × features
dim(dfm_clean)
[1]   2 111
Code
# Top features across the corpus
quanteda::topfeatures(dfm_clean, n = 15)
      of      the      and       in       is       to    rules  grammar 
      13       11       11        7        5        5        4        3 
       a language       as     that   langue   parole   system 
       3        3        3        3        3        3        2 
Code
# Weight by TF-IDF (downweights features common across all documents)
dfm_tfidf <- quanteda::dfm_tfidf(dfm_clean)
quanteda::topfeatures(dfm_tfidf, n = 10)
         as      langue      parole       sound   formation composition 
     0.9031      0.9031      0.9031      0.6021      0.6021      0.6021 
    between         his   according    specific 
     0.6021      0.6021      0.6021      0.6021 
Code
# Simple frequency plot
top15 <- quanteda::topfeatures(dfm_clean, n = 15)
data.frame(word = names(top15), freq = top15) |>
  ggplot(aes(x = reorder(word, freq), y = freq)) +
  geom_col(fill = "steelblue", color = "white") +
  coord_flip() +
  theme_bw() +
  labs(title = "Top 15 tokens in example corpus",
       x = "Token", y = "Frequency")

Your turn!

Q9 You tokenise a text with quanteda::tokens(corp, remove_punct = TRUE) and then run tokens_remove(toks, stopwords("en"), padding = TRUE). What does padding = TRUE do?





Q10 What is a document-feature matrix (DFM), and which of the following correctly describes its structure?





Challenge!

Q11 How many word tokens does linguistics04.txt contain?

Hover for a hint
Show solution
readLines(here::here("data/testcorpus/linguistics04.txt")) |>
  paste(collapse = " ") |>
  str_split("\\s+") |>
  unlist() |>
  length()

Q12 How many individual characters does linguistics04.txt contain?

Hover for a hint
Show solution
readLines(here::here("data/testcorpus/linguistics04.txt")) |>
  paste(collapse = " ") |>
  strsplit("") |>
  unlist() |>
  length()

Citation and Session Info

Schweinberger, Martin. 2026. String Processing in R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/string/string.html (Version 2026.02.24).

@manual{schweinberger2026string,
  author       = {Schweinberger, Martin},
  title        = {String Processing in R},
  note         = {https://ladal.edu.au/tutorials/string/string.html},
  year         = {2026},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL)},
  address      = {Brisbane},
  edition      = {2026.02.24}
}
Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] flextable_0.9.11 here_1.0.2       checkdown_0.0.13 udpipe_0.8.11   
 [5] tm_0.7-16        NLP_0.3-2        quanteda_4.2.0   lubridate_1.9.4 
 [9] forcats_1.0.0    stringr_1.5.1    dplyr_1.2.0      purrr_1.0.4     
[13] readr_2.1.5      tidyr_1.3.2      tibble_3.2.1     ggplot2_4.0.2   
[17] tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] fastmatch_1.1-6         gtable_0.3.6            xfun_0.56              
 [4] htmlwidgets_1.6.4       lattice_0.22-6          tzdb_0.4.0             
 [7] vctrs_0.7.1             tools_4.4.2             generics_0.1.3         
[10] parallel_4.4.2          klippy_0.0.0.9500       pkgconfig_2.0.3        
[13] Matrix_1.7-2            data.table_1.17.0       RColorBrewer_1.1-3     
[16] S7_0.2.1                assertthat_0.2.1        uuid_1.2-1             
[19] lifecycle_1.0.5         compiler_4.4.2          farver_2.1.2           
[22] textshaping_1.0.0       codetools_0.2-20        litedown_0.9           
[25] fontquiver_0.2.1        fontLiberation_0.1.0    SnowballC_0.7.1        
[28] htmltools_0.5.9         yaml_2.3.10             crayon_1.5.3           
[31] pillar_1.10.1           openssl_2.3.2           fontBitstreamVera_0.1.1
[34] commonmark_2.0.0        stopwords_2.3           zip_2.3.2              
[37] tidyselect_1.2.1        digest_0.6.39           stringi_1.8.4          
[40] slam_0.1-55             labeling_0.4.3          rprojroot_2.1.1        
[43] fastmap_1.2.0           grid_4.4.2              cli_3.6.4              
[46] magrittr_2.0.3          patchwork_1.3.0         withr_3.0.2            
[49] gdtools_0.5.0           scales_1.4.0            timechange_0.3.0       
[52] rmarkdown_2.30          officer_0.7.3           ragg_1.3.3             
[55] askpass_1.2.1           hms_1.1.3               evaluate_1.0.3         
[58] knitr_1.51              markdown_2.0            rlang_1.1.7            
[61] Rcpp_1.1.1              glue_1.8.0              xml2_1.3.6             
[64] renv_1.1.7              rstudioapi_0.17.1       jsonlite_1.9.0         
[67] R6_2.6.1                systemfonts_1.3.1      
AI Transparency Statement

This tutorial was written with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to substantially expand and restructure a shorter existing LADAL tutorial, adding the base R reference section, the full stringr coverage, str_glue and str_glue_data interpolation examples, the forcats section, string padding and formatting for table output, the encoding and Unicode section, the regular expressions section (including named capture groups and lookaround assertions), the text-cleaning pipelines section, and the expanded quanteda tokenisation section. All content was reviewed and approved by Martin Schweinberger, who takes full responsibility for its accuracy.


Back to top

Back to LADAL home


References